Significant Feature Clustering

نویسنده

John Samuel Whissell

چکیده

In this thesis, we present a new clustering algorithm we call Significance Feature Clustering, which is designed to cluster text documents. Its central premise is the mapping of raw frequency count vectors to discrete-valued significance vectors which contain values of -1, 0, or 1. These values represent whether a word is significantly negative, neutral, or significantly positive, respectively. Initially, standard tf-idf vectors are computed from raw frequency vectors, then these tf-idf vectors are transformed to significance vectors using a parameter α, where α controls the mapping -1, 0, or 1 for each vector entry. SFC clusters agglomeratively, with each document’s significance vector representing a cluster of size one containing just the document, and iteratively merges the two clusters that exhibit the most similar average using cosine similarity. We show that by using a good α value, the significance vectors produced by SFC provide an accurate indication of which words are significant to which documents, as well as the type of significance, and therefore correspondingly yield a good clustering in terms of a well-known definition of clustering quality. We further demonstrate that a user need not manually select an α as we develop a new definition of clustering quality that is highly correlated with text clustering quality. Our metric extends the family of metrics known as internal similarity, so that it can be applied to a tree of clusters rather than a set, but it also factors in an aspect of recall that was absent from previous internal similarity metrics. Using this new definition of internal similarity, which we call maximum tree internal similarity, we show that a close to optimal text clustering may be picked from any number of clusterings created by different α’s. The automatically selected clusterings have qualities that are close to that of a well-known and powerful hierarchical clustering algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

متن کامل

Steel Consumption Forecasting Using Nonlinear Pattern Recognition Model Based on Self-Organizing Maps

Steel consumption is a critical factor affecting pricing decisions and a key element to achieve sustainable industrial development. Forecasting future trends of steel consumption based on analysis of nonlinear patterns using artificial intelligence (AI) techniques is the main purpose of this paper. Because there are several features affecting target variable which make the analysis of relations...

متن کامل

طبقه‌بندی بیماری پارکینسون بر مبنای شاخص‌های درون-ناحیه‌ای و بین-ناحیه‌ای شبکه حرکتی مغز با استفاده از دادگان fMRI حالت استراحت

Parkinson’s disease (PD) is a progressive neurological disorder characterized by tremor, rigidity, and slowness of movement. Recent studies on investigation of the brain function show that there are spontaneous fluctuations between regions at rest as resting state network affected in various disorders. In this paper, we used amplitude of low frequency fluctuation (ALFF) for the study of intra-r...

متن کامل

MLIFT: Enhancing Multi-label Classifier with Ensemble Feature Selection

Multi-label classification has gained significant attention during recent years, due to the increasing number of modern applications associated with multi-label data. Despite its short life, different approaches have been presented to solve the task of multi-label classification. LIFT is a multi-label classifier which utilizes a new strategy to multi-label learning by leveraging label-specific ...

متن کامل

Supervised Feature Extraction of Face Images for Improvement of Recognition Accuracy

Dimensionality reduction methods transform or select a low dimensional feature space to efficiently represent the original high dimensional feature space of data. Feature reduction techniques are an important step in many pattern recognition problems in different fields especially in analyzing of high dimensional data. Hyperspectral images are acquired by remote sensors and human face images ar...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Significant Feature Clustering

نویسنده

چکیده

منابع مشابه

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

Steel Consumption Forecasting Using Nonlinear Pattern Recognition Model Based on Self-Organizing Maps

طبقه‌بندی بیماری پارکینسون بر مبنای شاخص‌های درون-ناحیه‌ای و بین-ناحیه‌ای شبکه حرکتی مغز با استفاده از دادگان fMRI حالت استراحت

MLIFT: Enhancing Multi-label Classifier with Ensemble Feature Selection

Supervised Feature Extraction of Face Images for Improvement of Recognition Accuracy

عنوان ژورنال:

اشتراک گذاری